In this example I will use QGIS geoprocessing scripts through the qgisprocess library in R. I will use data from the 2005 Peru Demographic and Health Survey, and data from the Peruvian government on the locations of secondary schools.

We use buffers from each DHS primary sampling unit location and point in polygon operations to measure whether a community had a secondary school within 5km.

Then, we use a hierarchical model to test whether a woman’s educational attainment is related to physical access to secondary schooling.

This is an example of a derived variable that cannot be obtained without the use of the GIS.

library(sf)
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(qgisprocess) 
## Using 'qgis_process' at 'C://OSGeo4W64/bin/qgis_process-qgis.bat'.
## Run `qgis_configure()` for details.
library(tmap)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
qgis_configure()
## getOption('qgisprocess.path') was not found.
## Sys.getenv('R_QGISPROCESS_PATH') was not found.
## Trying 'qgis_process' on PATH
## Error in rethrow_call(c_processx_exec, command, c(command, args), stdin, : Command 'qgis_process' not found @win/processx.c:994 (processx_exec)
## Found 2 QGIS installations containing 'qgis_process':
##  C://OSGeo4W64/bin/qgis_process-qgis.bat
## C://OSGeo4W64/bin/qgis_process-qgis-dev.bat
## Trying command 'C://OSGeo4W64/bin/qgis_process-qgis.bat'
## Success!

Read in the DHS sampling unit locations

peru_dhs_points<-st_read("~/OneDrive - University of Texas at San Antonio/projects/LAPIVpaper/PEGE52FL/PEGE52FL.shp", "PEGE52FL")
## Reading layer `PEGE52FL' from data source `C:\Users\ozd504\OneDrive - University of Texas at San Antonio\projects\LAPIVpaper\PEGE52FL\PEGE52FL.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1414 features and 20 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -81.30633 ymin: -18.22461 xmax: 0 ymax: 0
## geographic CRS: WGS 84
peru_dhs_points<-st_transform(peru_dhs_points, crs=24892)

#project the data into a meter based system
peru_dhs_points<-peru_dhs_points%>%
  filter(LATNUM<0)

Point in Polygon analysis

So, right now we only have points (locations of sampling units), so we need a buffer around each point in order to do our assessment of whether a school is within 5km of each community. To do this we will do a fixed distance buffer of 5km around each sampling location.

Buffer analysis

Here I do a 5km buffer around each PSU location.

peru_buff<-st_buffer(peru_dhs_points, dist = 5000)

now we have our parameters defined, we run the script:

tmap_mode("view")
## tmap mode set to interactive viewing
tm_basemap("OpenStreetMap.Mapnik")+
  tm_shape(peru_buff)+
  tm_polygons("DHSCLUST")+
  tm_shape(peru_dhs_points)+
  tm_dots()

5km buffers done! Now, we need to do our point in polygon operation. So I read in the Peruvian school data. These are a csv file, but have lat/long specified, so I can read in the text and make a simple feature layer from it. You can’t have missing information in the coordinate data to do this.

library(dplyr)
schools<-read.csv("~/OneDrive - University of Texas at San Antonio//classes/dem7093/dem7093_21//data/perusecondaryschools.csv", header=T)
schools<-schools%>%
  filter(complete.cases(Latitud, Longitud))

persch<-st_as_sf(schools, coords=c("Longitud", "Latitud"), crs=4326,agr="constant")
persch<-st_transform(persch,crs=24892 )


tm_basemap("OpenStreetMap.Mapnik")+
  #tm_shape(peru_buff)+
  #tm_polygons("DHSCLUST")+
  tm_shape(persch)+
  tm_dots("Departamento")

points in polygons

See what the right script is, and what arguments it needs:

peru_buff$nsch <- lengths(st_intersects(peru_buff, persch))
hist(peru_buff$nsch)

peru_buff$closeschool<-ifelse(peru_buff$nsch>0,1,0)
summary(peru_buff$closeschool)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  1.0000  1.0000  0.9326  1.0000  1.0000

So we see that 93.2% of communities have a secondary school within 5km, that’s great, but what about the communities that don’t?

Merge the new location data to our survey

Here we load the 2005 DHS data, and do some recodes for educational attainment, age, and get the survey design variables.

library(haven)
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
library(dplyr)

peru_data<-read_dta("~/OneDrive - University of Texas at San Antonio/projects/LAPIVpaper/DHS_IPVdata/PEIR51FL.DTA")

peru_data2<-peru_data%>%
  mutate(eduprim=ifelse(v106==1,1,0),
  edusecplus=ifelse(v106!=9&v106>=2,1,0),
  pwt=v005/1000000, 
  modcontra = ifelse(v313==3, 1, 0),
  purchdes = car::Recode(v743b, recodes = "1:3=1;4:6=0;0=NA;else=NA"),
  age=v012,
  knowhiv = ifelse(v751==1, 0, 0),
  ipv.viol=ifelse(d105a==1|d105a==2|d105b==1|d105b==2| d105c==1|d105c==2 | d105d==1|d105d==2| d105e==1|d105e==2| d105g==1|d105g==2,1,0))%>%
  dplyr::select(v000, v021, v022,v024,age, pwt, eduprim, edusecplus, modcontra, purchdes, knowhiv, ipv.viol )

head(peru_data2)
library(survey)

Now we merge the survey to the spatial data:

peru_merge<-left_join(peru_data2, peru_buff,by=c("v021"="DHSCLUST"))

In order to test for the effect of school access on women’s educational attainment, we use a binomial generalized linear model.

des<-svydesign(ids=~v021, strata=~v022, weights=~pwt, data=peru_merge, nest = T)

fit<-svyglm(edusecplus~closeschool+age+I(age^2), family = binomial, design=des)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(fit)
## 
## Call:
## svyglm(formula = edusecplus ~ closeschool + age + I(age^2), design = des, 
##     family = binomial)
## 
## Survey design:
## svydesign(ids = ~v021, strata = ~v022, weights = ~pwt, data = peru_merge, 
##     nest = T)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.0569699  0.2787120   7.380 2.73e-13 ***
## closeschool  1.7162034  0.1874376   9.156  < 2e-16 ***
## age         -0.1210658  0.0136170  -8.891  < 2e-16 ***
## I(age^2)     0.0009932  0.0002075   4.787 1.88e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1.058238)
## 
## Number of Fisher Scoring iterations: 4

So, we see in this case that if women lived within 5km of a secondary school, they are much more likely to have a secondary education or more, compared to woman living in a community without a school within 5km.